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Abstract 

Background: Integrating and analyzing heterogeneous genome-scale data is a huge algorithnnic challenge for 
modern systems biology. Bipartite graphs can be useful for representing relationships across pairs of disparate data 
types, with the interpretation of these relationships accomplished through an enumeration of maximal bicliques. 
Most previously-known techniques are generally ill-suited to this foundational task, because they are relatively 
inefficient and without effective scaling. In this paper, a powerful new algorithm is described that produces all 
maximal bicliques in a bipartite graph. Unlike most previous approaches, the new method neither places undue 
restrictions on its input nor inflates the problem size. Efficiency is achieved through an innovative exploitation of 
bipartite graph structure, and through computational reductions that rapidly eliminate non-maximal candidates from 
the search space. An iterative selection of vertices for consideration based on non-decreasing common 
neighborhood sizes boosts efficiency and leads to more balanced recursion trees. 

Results: The new technique is implemented and compared to previously published approaches from graph theory 
and data mining. Formal time and space bounds are derived. Experiments are performed on both random graphs and 
graphs constructed from functional genomics data. It is shown that the new method substantially outperforms the 
best previous alternatives. 

Conclusions: The new method is streamlined, efficient, and particularly well-suited to the study of huge and diverse 
biological data. A robust implementation has been incorporated into GeneWeaver, an online tool for integrating and 
analyzing functional genomics experiments, available at http://geneweaver.org. The enormous increase in scalability 
it provides empowers users to study complex and previously unassailable gene-set associations between genes and 
their biological functions in a hierarchical fashion and on a genome-wide scale. This practical computational resource 
is adaptable to almost any applications environment in which bipartite graphs can be used to model relationships 
between pairs of heterogeneous entities. 
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Background 

Bicliques have a long history of applications. The enumer- 
ation of maximal bicliques can be traced at least as far 
back as the seminal work reported in [1]. There the prob- 
lem was defined in terms of rectangles, binary relations 
and concept lattices. Subsequent progress on concept 
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lattices was surveyed in [2,3]. Algorithms for their identi- 
fication were applied to the analysis of gene co-expression 
data in [4,5]. 

A variety of biological challenges can be addressed 
by finding maximal bicliques in bipartite graphs. Rep- 
resentative applications include biclustering microarray 
data [6-8], optimizing phylogenetic tree reconstruction 
[9], identifying common gene-set associations [10], inte- 
grating diverse functional genomics data [11], analyzing 
proteome-transcriptome relationships [12], and discover- 
ing patterns in epidemiological research [13]. Statistical 
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approaches have been applied to some of these problems, 
but in many cases a discrete approach is beneficial or 
required because of the structure and diversity of the data 
under study. 

Let us describe a few specific examples. Bicliques have 
been used in the analysis of gene expression data to rep- 
resent subsets of genes and subsets of conditions, each 
pair with a high similarity score [6]. Graph theoretical 
approaches have been proposed in this setting to find 
bicliques in the resultant bipartite graphs that model 
genes and conditions with vertices, and co-expression lev- 
els with edge weights [7,8,14]. Bicliques have been used in 
phylogenetics to improve the accuracy of tree reconstruc- 
tion [9]. Such a tree denotes evolutionary relationships 
among species thought to have a common ancestor. Data 
with no fewer than k genes sampled from no fewer than 
m species are extracted from sequence databases. This 
operation is equivalent to finding maximal bicliques with 
partition sizes at least k and m. Bicliques have been used 
in epidemiological research to identify sets of individuals 
who share common sets of features. Bipartite graphs can 
help capture relationships between organisms and a wide 
range of factors. Maximal bicliques are particularly useful 
in case-control studies involving categorical features such 
as genotypes and exposures [13]. 

Our work has been largely motivated by the compu- 
tational demands of systems like GeneWeaver [11], a 
web-based software platform for the integration of func- 
tional genomics data. GeneWeaver includes a database 
containing lists of genes from diverse sources, along with 
descriptive metadata associated with these lists. Through 
gene homology, the lists can be combined across species 
such that genes on the lists are translated to a common ref- 
erence. This enables the construction of a bipartite graph, 
with vertex partitions representing individual genes ver- 
sus the gene lists. A suite of tools built on the enumer- 
ation of maximal bicliques and other bipartite analyses 
allows the user to identify groups of genes that are asso- 
ciated with related biological functions, all without any 
prior knowledge or assumption about such group associ- 
ations. Efficiency and scalability are paramount, because 
real-time maximal biclique enumeration is required for 
web-based user-driven analyses, as well as for effective 
computations over the entire data repository. 

Problem 

In each of the aforementioned applications involving an 
integration of multiple sets of genome-scale data, bipar- 
tite graphs can be used to represent relationships across 
pairs of heterogeneous data types. An interpretation of 
such a relationship is accomplished through an enumer- 
ation of maximal bicliques. Let us be precise about what 
this means. A bipartite graph is one whose vertices can 
be partitioned into a pair of non-empty, disjoint partitions 



such that no two vertices within the same partition are 
connected by an edge. Let G denote a bipartite graph, let 
U and V denote its two vertex partitions, and let E denote 
its edge set. A biclique in such a graph is a complete bipar- 
tite subgraph, that is, a bipartite subgraph containing all 
permissible edges. The notion is formalized as follows: 

Definition 1. Let G = (U U V,E) denote a bipartite 
graph. A biclique C = {W , V) is a subgraph ofG induced 
by a pair of two disjoint subsets ^ U, c. V, such that 
WueU'.ve V, (u, v) g £. 

A maximum biclique is a largest biclique in a graph. 
Unlike the well-known maximum clique problem, there 
are two distinct variants of the maximum biclique prob- 
lem: the vertex maximum biclique problem and the edge 
maximum biclique problem. The former asks that we 
find a biclique with the largest number of vertices, and 
can be solved in polynomial time [15]. The latter asks 
that we find a biclique with the largest number of edges, 
and is A/^P-complete [16]. In biological applications, the 
edge maximum biclique is often desirable because it mod- 
els more balanced connectivity between the two vertex 
classes. For example, an edge maximum biclique may 
group together numerous related biological processes and 
a modest set of their common genes, whereas a vertex 
maximum biclique may instead group together only a tiny 
set of related biological processes with great numbers of 
common genes. 

A maximal biclique is one not contained in any larger 
biclique. Examples of maximum and maximal bicliques 
are shown in Figure 1. The enumeration version of our 
problem is to find all maximal bicliques in a bipartite 
graph. In so doing, it turns out that we actually gener- 
ate both edge maximum and vertex maximum bicliques. 
Thus, we are chiefly concerned with this enumeration 
problem, formalized as follows: 

Input : A bipartite graph G = {UVJ V,E). 
Output: All maximal bicliques, or subsets of U 
and of V, for which the induced subgraph 
G[U' U V] is complete, and there are no subsets 

2 W and V D V , or 2 W and V" 2 V , 
such that G[U" U V"] is also complete. 

As observed in [17], the maximal biclique enumera- 
tion problem cannot be solved in polynomial time since 
the number of maximal bicliques may be exponential in 
the graph size. Nevertheless, there remains a demand 
for efficiency, because we often need exact solutions to 
large-scale instances in real time. The Maximal Biclique 
Enumeration Algorithm (MBEA) that we will define here 
finds all maximal bicliques. It exploits structure inher- 
ent in bipartite graphs. It employs a branch-and-bound 
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Figure 1 Maximum and maximal bicliques. A bipartite grapli G] lias an edge maximum biclique B] ({U] , uj], {vi , yj, ^3,]) witli 5 vertices and 6 
edges, and a vertex maximum biclique Bj^iui, U4, us, ue, U7}, {1/5}) witli 6 vertices and 5 edges. Botli B] and B2 are maximal. 



technique to prune non-maximal candidates from the 
search tree. Its pruning is accelerated by directly removing 
dominated vertices from the candidate set. Our experi- 
mental results demonstrate that the resultant reduction in 
search space enables MBEA to scale to the tens of thou- 
sands of nodes currently encountered in analyzing large 
biological data sets. In addition, we created an improved 
version, iMBEA, that selects candidate vertices in the 
order of common neighborhood size and that uses an 
enhanced version of branch pruning. 

Related work 

With widespread applications such as those just discussed, 
one would expect a plethora of algorithms targeting max- 
imal bicliques on bipartite graphs. Most algorithms that 
achieve this purpose, however, are either not tailored for 
bipartite graphs or not designed specifically for maximal 
biclique enumerations. Most existing graph algorithms 
for solving this problem fall into two main categories: (i) 
those designed for bipartite graphs but that either place 
undue restrictions on the input or require reduction to 



other problems, and (ii) those designed for general graphs 
and are thus unable to take advantage of bipartite graph 
structure. Table 1 lists these algorithms, their inputs and 
outputs (with restrictions, if any), and the methods they 
use. 

Algorithms for bipartite graphs 

Existing algorithms for finding maximal bicliques in 
bipartite graphs are further divided into the following 
three approaches: exhaustive search with restrictions on 
outputs, reduction to the clique enumeration problem 
on general graphs, and reduction to the frequent itemset 
mining problem in transaction databases. 

The most intuitive approach entails exhaustively build- 
ing all subsets of one vertex partition, finding their inter- 
sections in the other partition, and checking each for 
maximality. Algorithms based on exhaustive search must 
generally place one or more restrictions on the problem to 
reduce its enormous search space. Moreover, exhaustive 
search requires storing generated bicliques to determine 
their maximality. An iterative algorithm is presented in 



Table 1 Previously presented graph algorithms for maximal biclique enumeration 


Algorithms 


Inputs 


Outputs 


Methods 


Sanderson et al. [9] 


Bipartite graph 


Maximal bicliques of 
bounded minimum size 


Exhaustive search by iterative biclique 
building 


Mushlinetal. [13] 


Bipartite graph 


Maximal bicliques of 
bounded sizes and 
fig ure-of- merit values 


Exhaustive search with a priority queue 


Zaki etal. [21] 


Bipartite graph 


Maximal bicliques 
(one partition) 


Frequent closed itemset mining in 
transaction databases 


Uno etal. [25] (LCM/LCM2) 


Bipartite graph 


Maximal bicliques 
(one partition) 


Frequent closed itemset mining 


Li et al. [26] (LCM-MBC) 


Bipartite graph 


Maximal bicliques 


Frequent closed itemset mining 


Makino&Uno [18] 


Bipartite graph 


Maximal bicliques 


Maximal clique finding in general graphs 


Tomita et al. [30] 


General graph 


Maximal cliques 


Maximal clique finding in general graphs 


Eppstein [1 7] 


General graph of 
bounded arboricity 


Maximal bicliques 


Exhaustive search 


Alexe etal. [27] (MICA) 


General graph 


Maximal bicliques 


A consensus algorithm 


Liu etal. [28] (MineMBC) 


General graph 


Maximal bicliques of 
bounded minimum size 


A divide-and-conquer approach 
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[9] to build subsets progressively, from pairs of vertices to 
collections of larger and larger sizes. It limits the sizes of 
both biclique vertex partitions, yet still requires enormous 
amounts memory to store the lists used to generate sub- 
graphs and decide maximality. The algorithm described 
in [13] builds bicliques based on set expansion and exten- 
sion operations. It employs a hash table that determines 
maximality to avoid pair wise biclique comparisons, and a 
queue to maintain bicliques prioritized by figure-of-merit 
values (e.g.,/7-values). Users can specify constraints on the 
figure-of-merit values to filter out bicliques of insufficient 
interest. 

The second approach relies on graph inflation. As 
observed in [18], the enumeration of maximal bicliques in 
a bipartite graph can be transformed into the enumera- 
tion of maximal cliques in a general graph by adding all 
possible edges between vertices within the same partition, 
thereby transforming each of the two disjoint vertex sets 
into a clique. This approach is neither practical nor scal- 
able, however, due to the enormous number of edges that 
may be needed and the concomitant increase in problem 
difficulty that is incurred. Given a bipartite graph G = 
(U U V,E) where \U\ = m, \V\ = n, \E\ = e, the number 
of edges needed to transform G to a corresponding graph 
G is (2) + (^). Thus, this method transforms the problem 

of finding maximal bicliques in a bipartite graph with edge 

e 

density d(G) = to the problem of finding maximal 

m X n 

^+(2) + (2) 

cliques in a graph G with density d(G) = _^ . 

( 2 ) 

Note that G might be dense even if G is sparse. When 

G has two vertex sets of equal size and no edges (i.e. 

2 

\U\ = \V\=n, \E\ = 0), G has a density ^ ~^ 50%. 

— n 

Figure 2(a,b) illustrates the correspondence between these 
two problems. 

A third approach comes from the field of data mining. It 
was observed in [19] that a transactional database can be 
represented by a bipartite graph, with a one-to-one corre- 
spondence between frequent closed itemsets and maximal 
bicliques. A subset of items is defined as a frequent itemset 
if it occurs in at least one transaction. On one hand, a 
frequent itemset and the set of transactions containing 
the frequent itemset form a biclique. On the other hand, 
the adjacency lists of a bipartite graph can be viewed as 
a transaction database by treating each vertex in one set 
as an item and each vertex in the other set as a transac- 
tion that contains a subset of items. A biclique can thus 
be mapped to a frequent itemset. A maximal biclique cor- 
responds to a frequent closed itemset, where a frequent 
itemset / is said to be closed if the set of transactions 
containing / do not contain a superset of /. The support 
of a frequent itemset is the number of transactions in 
which the set occurs. Enumerating all maximal bicliques 



is equivalent to enumerating all frequent closed itemsets 
with support at least 1. Figure 2(a,c) shows a mapping 
between these two problems. A correspondence between 
maximal bicliques of a general graph and frequent closed 
itemsets has been shown [20], leading to the sugges- 
tion that FPclose and similar frequent itemset mining 
methods [21-25] may be helpful in enumerating maximal 
bicliques. Implementations of this approach require a 
post-processing step to obtain the transaction set for 
each frequent closed itemset, as described in [26]. This is 
because the published methods output only the frequent 
itemsets (which correspond to half bicliques). Although 
this post-processing step is straightforward enough, it 
can be prohibitively time-consuming when the number 
of maximal bicliques is large. Moreover, known methods 
take the support level as an input parameter, and find only 
frequent closed itemsets above the given support. (In gen- 
eral, the lower the support, the longer the algorithms take. 
A support of 1 is the most difficult, since at this level all 
frequent closed itemsets must be found.) 

Algorithms for general graphs 

Maximal bicliques can also be found with algorithms 
designed for general graphs. Such algorithms of course 
lack any efficiency gains that might be accrued from 
utilizing bipartite graph structure. The maximal biclique 
enumeration problem was studied from a theoretical 
viewpoint in [17], where the focus was on graphs of 
bounded arboricity. It was proved that all maximal 
bicliques in a graph of order n and arboricity a can be 
enumerated in 0{a^2^^n) time. This approach is not prac- 
tical for large graphs, however, because it is unrealistic 
to expect that arboricity would be limited in practice 
[19]. A suite of consensus algorithms was presented in 
[27] for finding complete bipartite (but not necessarily 
induced) subgraphs. Unfortunately, these algorithms need 
to keep all maximal bicliques in memory. The Modular 
Input Consensus Algorithm (MICA), the most efficient 
among them, has space complexity 0(B) and time com- 
plexity 0(n^B)f where B denotes the number of maximal 
bicliques. An algorithm (MineLMBC) based on divide- 
and-conquer was proposed in [28] to mine large maximal 
bicliques from general graphs by putting size constraints 
on both vertex sets to iteratively prune the search space. 
The algorithm reduces the space complexity to OirP') 
and the time complexity to 0(n^B). The algorithm on 
dense graphs from 2nd DIMACS Challenge benchmarks 
outperforms MICA when minimum biclique sizes are 
constrained by certain thresholds. 

To solve the biclique enumeration problem, restrictions 
on either inputs or outputs have been proposed to reduce 
the search space. These include bounding the maximum 
input degree [7], bounding an inputs arboricity [17], and 
bounding the minimum biclique size [9,28] or figure- 
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Figure 2 Bicliques, cliques, and frequent closed itemsets. The relationship between bicliques in a bipartite graph, cliques in a general graph, 
and frequent closed itemsets in a transaction database is exemplified. In (a), the bipartite graph Gj has a maximal biclique B = ({U] , uj}, {V2, vs}). In 
(b), G2 has been transformed into G2 by adding edges (dashed lines) between every pair of nodes in the same partition. The vertices of B now form 
a maximal clique in G2. In (c), a transaction database is built from G2 by treating V as the transaction set and U as the item set. B can now be viewed 
as a frequent closed itemset in this database. 



of-merit [13]. Naturally, no algorithm relying on these 
restrictions can solve arbitrary bipartite instances. 

Methods 

The algorithm we shall now describe achieves efficiency 
without I/O or other restrictions. Performance testing 
on both synthetic and biological graphs demonstrate that 
it is markedly superior to MICA [27], the best known 
algorithm for finding bicliques on general graphs, and to 
LCM-MBC [26], a state-of-the-art frequent itemset algo- 
rithm that improves upon and adds a post-processing 
step to LCM [25]. The Maximal Biclique Enumeration 
Algorithm (MBEA) combines backtracking with branch- 
and-bound techniques to prune away regions of the search 
tree that cannot lead to maximal bicliques. MBEA is 
inspired by the classic maximal clique-finding method of 
[29], which was refined and shown to have optimal time 
complexity in [30] . The search space for MBEA is limited 
to disjoint vertex sets because, in a biclique, vertices in one 
set determine those in the other. 

Algorithmic basics 

Let G = (LI U V,E) he 2i bipartite graph and assume, 
without loss of generality, that \l[\ > \ V\, MBEA operates 
on the (potentially smaller) set V, utilizing the following 
four dynamically changing sets of vertices: (i) R, a sub- 
set of V, (ii) L, a subset of U containing all the common 
neighbors of (iii) P, a subset of V containing candidate 
vertices that may be added to R, and (iv) Q, a subset of V 
containing former candidates, that is, vertices that were 
previously considered for R. The sets R, L, P and Q are 
employed in a depth-first traversal of a recursion tree to 
form maximal bicliques. R and L are used to form such 
a biclique, where R determines L, P is used for biclique 
expansion. Q is used to determine maximality. P, Q and R 
are required to satisfy the following two conditions: 

• (P n Q) U (P n R) U (Q n R) = 0. That is, P, Q,R 
are pairwise disjoint. 

• PUQ = {v|vG V\R, 3ueL, (u, v) g £}. That is, P 
and Q contain every vertex in V but not R that is 
adjacent to at least one vertex in L. 



Observation 1. The subgraph induced by (L,R) is a 
biclique. 

For simplicity, and since G is bipartite, we henceforth 
drop the reference to induced subgraph, and simply say 
that (L, i?) is a biclique. Note that (L, i?) is maximal iff there 
is no vertex in U\L that is adjacent to all vertices in R 
and no vertex in V\R that is adjacent to all vertices in L. 
Because L is defined by R, only the maximality of R need 
be considered. 

Observation 2. (L, 7?) is maximal iff no vertex in V\R is 
adjacent to every vertex in L 

If P contains a vertex that is adjacent to all vertices in L, 
then (L, R) is not maximal. Thus that vertex may as well 
be moved from P to R, This process can be iterated until 
no more vertices can be so moved. On the other hand, if 
none of the elements of V\R is a common neighbor of all 
vertices in L, then (L, R) is maximal because L and R are 
the largest set of common neighbors of each other. 

Observation 3. Let S denote {v \ v e P 8>l (u,v) e EWu e 
L}. Then (L, RU S) is a maximal biclique. 

If Q contains a vertex that is adjacent to all vertices in L, 
then not only (L, R) is not maximal, but also there can be 
no S as defined above for which (L, RU S) is maximal. We 
can actually say slightly more than this, as follows. 

Observation 4. Let T denote {v | v g Q & (w, v) e EVu e 

L], V denote any subset ofL, and denote any subset of P. 
Unless T is empty, (V, R U is not a maximal biclique. 

Observation 4 is used to prune unproductive subtrees 
in a branch-and-bound style exploration of the maximal 
biclique search space. As Observation 2 shows, if Q con- 
tains a vertex v adjacent to all vertices in L, it means that 
biclique (L,R) is not maximal. We further observe that 
none of the bicliques extended from R contains v, since R 
does not contain v. However, v is adjacent to all vertices in 
any subset of L. Thus, no bicliques extended from such a 
node is maximal and its subtrees can be pruned away. The 
utility of these observations is explicated in Figure 3. 
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Figure 3 Algorithmic observations. Examples are shown to illustrate (a) Observation 1 , (b) Observation 2, (c) Observation 3, and (d) Observation 4. 




Algorithmic description 

To aid discussion, we refer the reader to pidgin pseu- 
docode for Algorithm: MBEA. (For the time being 
we shall ignore starred lines that describe subsequent 
improvements.) Overall, a depth-first search tree traver- 
sal is performed recursively using the core function 
biclique_f ind ( ) . Initially, all vertices are biclique 
candidates {P = V,L = LI), while the biclique and for- 
mer candidate sets are empty {R = Q = 0). As the 
computation proceeds, R grows but L and P shrink. At 
each node of the search tree, biclique_f ind takes as 
input a 4-tuple (L, R, P, Q) and selects a candidate x from 
P. An extension step augments R with x to form R^, and 
forms V from L by removing all vertices not connected 
to X, This makes a set of common neighbors for R\ P^ 
and are then formed by eliminating vertices not con- 
nected to L\ P' also loses vertices connected to all of L. 
These are added to R, If no vertex in Q' is connected to 
all of V , then a maximal biclique {V , R) has been found. 
A recursive call is made with (L\R\P\ Q^. x is removed 
from P and added to Q. The process stops when either 
P is empty or a vertex in Q is connected to all of L. An 
example of the search performed by MBEA is depicted in 
Figure 4. 



Theorem 1. MBEA finds all maximal bicliques in a 
bipartite graph. 

Proof. MBEA explores the entire search space of all 
the subsets of one vertex set and finds all the bicliques 
by Observation 1. It checks their maximality by Obser- 
vation 2. It eliminates only those that cannot lead to 
other maximal bicliques by Observations 3 and 4. There- 
fore, upon termination, MBEA has found all maximal 
bicliques. □ 

Improvements to MBEA 

We seek to improve MBEA in two ways: by an early 
removal of vertices from the candidate set, and by a selec- 
tion of candidate set vertices in non-decreasing order of 
common neighborhood size. Both actions tend to help 
prune the recursion tree by avoiding the generation of 
non-maximal nodes. 

Tree Pruning 

Recall Observation 3, which asserts that if P contains a 
subset S of vertices that are adjacent to all vertices in L, 
then (L, U 5') is a maximal biclique. Our first modification 
is based on an extension of this observation. Although it 
suggests the addition of candidates whose neighborhoods 
contain that of x, upon recursive return MBEA treats 
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vertices in S just as it does other vertices in the candidate 
set. That is, every vertex in S is still selected for expansion, 
even though some will lead to non-maximal subsets only. 
The generation of such branches can be avoided if we sub- 
divide S into two subsets as follows. For any v e S, either 
the neighborhood of v is a proper superset of the neigh- 
borhood oix (i.e., Nl(v) D Nl(x)), or its neighborhood is 
exactly the same as that oix (i.e., Nl(v) = Nl(x)), 

Vertices of the second group can thus be moved directly 
to Q upon recursive return, because any biclique that 
excludes x but includes v is a subgraph of a biclique 
including both x and v. See Figure 5 for an example. 

This construction is formulated as follows: 

Observation 5. Any vertex in P with neighborhood L 
must be an element of the current biclique, and thus can be 
added to R. Otherwise, any biclique in the current subtree 
is non-maximal. 

Candidate selection 

Observe that MBEA chooses candidates in given (arbi- 
trary) order. The second modification we consider was 
inspired by noticing that leftmost branches, which are 
explored earlier, generally have more candidates to gener- 
ate sub-branches than do rightmost branches, which are 
searched later, as long as the selected candidates have the 
same number of connections to L. 

Consider for example a connected bipartite graph G4 = 
(U U !/,£) where \U\ = 4^,\V\ = 3 and vertex vi e V is 
adjacent to all vertices in LI (Figure 5(b)). If vi is the first 
selected candidate, then both V2 and V3 are candidates at 
the same level because both connect to at least one ver- 
tex in U, Both {V2} and {V3} are non-maximal, however. 



since they are subgraphs of bicliques including vi . On the 
other hand, if vi is the last selected candidate, then there 
is no vertex left in the candidate set because V2 and V3 
have been explored earlier. Vertex vi is thus directly added 
to all bicliques according to Observation 3, since vi is 
adjacent to all vertices in L. We conclude that selecting 
candidates in non-decreasing order of common neighbor- 
hood size may avoid generating numerous non-maximal 
subsets. Moreover, it can lead to more balanced recursion 
trees, which is an important property in load-balanced 
parallelization. 

Improved algorithmic details 

To distinguish the basic method from the improved, we 
shall denote the latter by iMBEA, the version incorporat- 
ing the two modifications just discussed. In Algorithm: 
MBEA, these executable additions are indicated with 
starred lines. The vagaries of data are important, natu- 
rally, and so improvements may not always be what they 
seem. For example, an effective way to create and main- 
tain a candidates list ranked by common neighborhood 
size is simply to insert a vertex into its proper place in the 
list. Although well-suited to this particular task, such a use 
of insertion sort may actually create a tradeoff between 
the potential time saved in searching versus that spent 
inserting. To see this, consider that overall time is proba- 
bly saved in the case of real or synthetic graphs with vari- 
able degree distributions. We may actually do better, on 
the other hand, to turn off this feature on highly contrived 
instances, especially those such as regular graphs in which 
all vertices have the same degree. See Figure 6, which 
shows differences between recursion trees produced by 
MBEA and iMBEA on a sample bipartite graph. 
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Algorithm: MBEA (Starred lines apply to iMBEA) 

procedure biclique_f ind (G,L,7?,P, Q) ; 

G: a bipartite graph G = (U U V,E); 

L: set of vertices e U that are common neighbors of 

vertices in i?, initially L = U; 

R: set of vertices € V belonging to the current biclique, 
initially empty; 

P: set of vertices e V that can be added to initially 
P = y, sorted by non-decreasing order of common 
neighborhood size; 

Q: set of vertices used to determine maximality, initially 
empty; 

* / ^ 0; / / Position of selected candidate 

in P 
while P 7^ 0 do 

X ^ P[i -\- -\-]; / / Select next candidate 
from P in order 
R' ^RU {x}; 

V ^{ueL \ (u,x) g£(G)}; 

// Observation 1: extend biclique 
''U ^L\L')C ^ W; 
P'^0;Q'^0;// Create new sets 
/ / Observation 2 : check maximality 
isjnaximal ^ TRUE; 
forall the v in Q do 

A^[v] ^{ueV\ (w,v) g£(G)}; 

// Observation 4: end of branch 

if I ATM I = IL'lthen 

isjnaximal ^ FALSE) 
hreak; 

_ else if \N[v] \ > 0 then Q ^Qu {v}; 

if is_maximal = TRUE then 
forall the v inP,v y^x do 

N[v] ^{ueL' \ {u, v) e £(G)}; // Get the 

neighbors of v in 
if\N[v] I = then 
R' ^R'U {v}; 

// Observation 3: expand 
to maximal 
*5 ^ {uel7\{u,v)eE{G)}; 
* if |5| = 0 then C ^ C U {v}; 

// Observation 5: further 
pruning 

* else if \N[v] \ > 0 then 

*P'^P'U{v}// Insert v into 
in non-decreasing order of 
common neighborhood size 

PRINT(I',i?0;// Report maximal 
biclique 

[ ifP' 7^ 0thenbiclique_find(G,Z^i?^P^QO; 

// Move C from candidate set to former 
candidate set 
_ * Q ^ Q U C; P ^ P \ C; 



Algorithmic complexity 

We first consider the time complexity of a brute-force 
algorithm that examines all subsets of the smaller ver- 
tex partition. Let G = (U U V,E) denote a bipartite 
graph, with \l[\ = m, \ V\ = n, and m > n. There are 
2^ subsets of V. It takes 0(mn) time to find each sub- 
sets common neighbors in U, It also takes 0(mn) time to 
decide maximality. Thus, the worst-case time complexity 
of this simple scheme is 0(2^ mn). 

Similarly, the worst case number of nodes in a recursion 
tree for MBEA is 2^, again bounded by the total number 
of subsets of V. At each such node, the time complexity 
of biclique_f ind is 0(dn), where d is the maximum 
degree of any vertex in V (the maximum number of ver- 
tices in L is d, and the maximum number of vertices in 
P and Q is n). Thus, the worst case time complexity of 
MBEA is 0(2^ dn). The total number of subsets examined 
by MBEA is considerably less than 2^, however, because 
branch-and-bound prunes the recursion tree. We shall 
therefore note that the number of nodes in a recursion tree 
is at least as large the total number of maximal bicliques, 
and analyze time complexity in a fashion similar to that 
performed in [17,27]). 

Lemma 1. Every intermediate node in the recursion 
tree for MBEA represents a distinct maximal biclique. 

Proof Nodes on MBEAs recursion tree represent 
maximal or non-maximal bicliques. Without pruning, a 
non-maximal node may be formed only when a candidate 
or a former candidate s neighborhood is a (not necessar- 
ily proper) superset of the current set L. In the former 
case, candidate vertices (from P) whose neighborhoods 
contain L are automatically added to R by Observation 3. 
Furthermore, if a candidate's neighborhood exactly equals 
1, then no branching is needed based on Observation 5. 
The biclique at any intermediate node is thus maximal 
because further candidate additions would reduce the size 
of L and lead to another maximal biclique. In the latter 
case, a former candidate whose neighborhood contains L 
leads to no more maximal bicliques from that branch. A 
non-maximal node with a former candidate connected to 
all vertices in its L is therefore a leaf. We conclude that all 
intermediate nodes in the recursion tree are maximal. □ 

Theorem 2» Given a bipartite graph G = (U U V,E) 
where \ U\ = m,\V\ = n,m > n, and \E\ = e, the time com- 
plexity of the Maximal Biclique Enumeration Algorithm 
for finding all maximal bicliques in G is 0(eB), where B is 
the number of maximal bicliques. The time complexity per 
maximal biclique is 0(e). 

Proof As proved in Lemma 1, MBEA expands only the 
nodes that are maximal bicliques on the recursion tree, 
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Figure 6 Recursion trees of MBEA versus iMBEA.The recursion trees Ts of MBEA and Tq of iMBEA for bipartite grapli G5. 



which means it creates only maximal bicliques as inter- 
mediate nodes on the tree and non-maximal bicliques can 
only be leaf nodes. In other words, the number of non- 
maximal bicliques created on the tree is at most the total 
number of the leaf nodes. For any intermediate node on 
the recursion tree, the number of its children that are 
leaf nodes representing non-maximal bicliques is less than 
n — 1. In the worst case, the number of intermediate 

(d-l) 

nodes is B = ^ — 1)S and the number of leaves is 

(n — 1)^ = 0(B), where d is the maximum degree of 
any vertex in V, Thus, the total number of nodes on the 
recursion tree is 0(B). We showed that the time complex- 
ity of biclique_f ind ( ) is 0(dn), It can be restated 
as 0(e), since MBEA must scan all edges for maximality 
and biclique expansion in the worst case. Combining time 
complexity 0(e) at each node with the total number of 
nodes in the recursion tree 0(B), we obtain a time com- 
plexity of 0(eB) for the overall algorithm, and a time 
complexity per maximal biclique of 0(e), □ 

To understand the algorithmic complexity a little 
deeper, we view MBEA under the concept of delay time, 
which we define as in [31] as the running time between 
the output of two consecutive maximal bicliques. In this 
framework, MBEA is a "polynomial delay time algorithm" 
because the elapsed time between the output of any two 
consecutive bicliques is polynomial in d and n. 

Theorem 3. MBEA is a polynomial delay time algorithm 
with delay complexity 0(d^y?). 

Proof, MBEA takes 0(dn) time to explore any single 
node in its recursion tree. A maximal (intermediate) node 
can have at most n — \ non-maximal neighbors (leaves). 
Even in the worst case, MBEA must traverse no more than 
back to the root of the tree to find the next maximal node. 
The depth of the tree is at most d. From this it follows that 
the delay complexity is 0(d^rP'), □ 



Theorem 4. Given a bipartite graph G = (U ^ V,E) 
where \U\ = m,\V\ = n,m > n, the worst- case space 
complexity of MBEA is 0(min(d, n)m). 

Proof, MBEA uses two vectors to store the two vertex 
sets of the biclique in each node of the recursion tree. The 
space for storing them is 0(m + n). When m > n, the 
space complexity at each node is 0(m), Since the depth of 
the tree is at most d, the overall space complexity of the 
recursion tree is 0(dm), Meanwhile, MBEA uses bitmap 
vectors to store adjacency matrix of the input bipartite 
graph, which requires 0(nm) space complexity. Therefore, 
the space complexity in total is 0(nm + dm) = 0((n + 
d)m) = 0(min(d,n)m), □ 

Thus, in the worst case, MBEAs space complexity is 
quadratic in the order of the graph. This should not be 
surprising. Indeed, such a result is the best that can be 
achieved by any algorithm that stores its entire input, 
since the input size is determined by the number of edges. 

Implementations and testing 

We implemented MBEA and iMBEA and compared 
them to existing implementations of what should be 
the two strongest competitors: MICA [27], currently 
the fastest graph theoretical algorithm for finding 
bicliques in general graphs, and LCM-MBC [26], currently 
among the most advanced data mining algorithms for 
finding pairs of frequent closed patterns, improving 
upon LCM [25]. An efficient implementation of MICA 
is available at http://genome.cs.iastate.edu/supertree/ 
download/biclique/README.html. Efficient codes for 
LCM can be found at http://fimi.ua.ac.be/src/. Version 
2 is reported to be the faster of the two available LCM 
implementations. The authors of [26] graciously pro- 
vided us with their implementation of LCM-MBC, which 
we used in our comparisons. MBEA/iMBEA and MICA 
accept graphs in a simplified DIMACS edge list format. 
LCM/LCM-MBC is not DIMACS compatible, however. 
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and required us to convert an edge list into an equivalent 
adjacency list for the smaller bipartite partition. Graphs 
come in many formats, of course, so we did not charge any 
time for this simple conversion. 

All implementations were compiled on and timings per- 
formed under the Ubuntu 12.04 (Precise Pangolin) x64 
operating system on a Dell OptiPlex 9010 Minitower with 
an Intel Core 17-3770 3.4 GHz processor, 16.0 GB DDR3 
non-ECC SDRAM memory at 1600 MHz (4 DIMMs), and 
a 500 GB 7200 RPM SATA hard drive. Only sequential 
implementations of MBEA, MICA and LCM-MBC were 
compared, each making use of a single compute core. 
MBEA and iMBEA were written in C and compiled with 
the GNU gcc compiler with 03 optimization turned on. 
The MICA and LCM-MBC implementations were also 
compUed with 03 turned on. The wallclock running times 
we report include both I/O and computation, but exclude 
the time taken to print out the maximal bicliques. They 
are the average of ten, five or three runs for graphs that 
can be finished within one minute, one hour or three days, 
respectively. Runs that exceeded three days were killed 
and omitted from the averages. We employed standard 
data reduction techniques to reduce the size of bipar- 
tite graphs for all methods tested. For example, during 
pre-processing, two or more vertices with the same neigh- 
borhood are merged into a single vertex; this process is 
reversed at post-processing. 

Biological graphs 

We tested the algorithms on biological graphs derived 
from functional genomics data. One set of graphs, which 
was extracted from cerebellum data, was created using 
a matrix of correlation p-vslues for gene expression to 
phenotypes across strains of mice in a single popula- 
tion [32]. The matrix consists of 45137 genes represented 
by microarray measures of transcript abundance and 
782 phenotypes to which the transcript abundances are 
correlated. A bipartite graph is obtained by placing an 
edge only where the correlation /?-value is at or below 
some preset threshold. The density of this graph can 
be varied by adjusting the threshold. The lower the p- 
value threshold, the lower the graph density. To test a 
wide variety of densities, we created twenty graphs over 
a range of thresholds, from 0.01 to 0.20, with a step 
of 0.01. 

The second set of graphs, which represent phenotype- 
gene associations, was created from a correlation matrix 
between 33 phenotypes and 17539 genes, calculated over 
a panel of more than 300 mice. For each threshold, 
a phenotype-gene edge is present if the correlation is 
at or above the threshold. We created graphs with a 
range of thresholds, so that the lowest threshold ran in 
a small fraction of a second and the largest in tens of 
minutes. 



In both sets, edge density increases across the range 
of thresholds, from roughly 0.2% to about 2.5% in the 
cerebellum graphs, and from roughly 6.6% to as high as 
37.4% in the pheno-gene graphs. Computational demands 
increase even more rapidly, because the number of max- 
imal bicliques tends to grow exponentially with a linear 
increase in threshold values. 

Random graphs 

In addition to biological graphs, we tested iMBEA and 
LCM-MBC on random bipartite graphs, using two differ- 
ent random graph models. The first is the classic Erdos- 
Renyi random graph model. Here, we fixed the number 
of vertices in each partition at 300 and varied the den- 
sity from 0.1 to 0.28. The density range was selected so 
that the lowest would run in well under a second and 
the highest would require several minutes. We also tested 
graphs with 400 and 500 vertices, but the results were sim- 
ilar enough to graphs with 300 vertices that we omit their 
discussion. 

For the second random graph model, we modified the 
Erdos-Renyi model so that we could study graphs with 
both high and low degree variability. The graph genera- 
tor takes as input these four parameters: the size m of 
the larger partition, the size n of the smaller partition, the 
average vertex degree fi in the smaller partition, and the 
coefficient of variation CV of the degrees in the smaller 
partition. (Recall that CV = a/fi, where a is the stan- 
dard deviation and /x is the mean.) These specifications 
were used to assign vertex degrees to the smaller partition. 
No edges were produced within a partition, of course. 
The assigned degrees in the smaller partition were used 
to place edges, selecting each endpoint in the larger par- 
tition with uniform probability. For example, if a vertex in 
the smaller partition had been assigned degree three, then 
three neighbors for it were uniformly selected from the 
larger partition. 

We created three sets of random graphs with this graph 
generator. The first set fixed the number of vertices in 
one partition at 10,000 and in the other partition at 1000, 
the edge density at 4.5%, and varied the CV from 0.3 to 
1.2. The purpose of this set was to test the behavior of 
MBEA versus iMBEA when the CV is varied, it being our 
intuition that iMBEA might be better suited to graphs 
with higher CV, The second and third sets of graphs were 
created to test iMBEA versus LCM-MBC when the rela- 
tive partition sizes were varied. In one set, the size of the 
larger partition is fixed at 10,000 and the size of the smaller 
partition is varied from 100 to 1000. In the other set, the 
size of the smaller partition is fixed at 500 and the size of 
the larger partition is varied from 5000 to 50,000. In both 
sets we used an edge density of 3.0%, which provided a 
wide spectrum of partition sizes while keeping runtimes 
within reason. 
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Results and discussion 

In this section, we compare runtimes of the various algo- 
rithms. MICA turns out not to be competitive on any of 
our graphs. We therefore exclude its timings from our 
presentation. For instance, iMBEA outperforms MICA by 
more than three orders of magnitude on even modest- 
sized biological graphs. On a somewhat larger graph, 
iMBEA finishes in under an hour while MICA runs 
for over three days without completion. And on the 
largest graphs, MICA runs out of memory. Thus, we 
feel it is manifest that MICA does not belong in the 
same class as algorithms such as MBEA and iMBEA, 
which are specifically targeted at bipartite graphs. We 
first concentrate on MBEA and iMBEA on both bio- 
logical and random graphs in order to demonstrate the 
performance gained by iMBEAs improved pruning. We 
then move on to compare iMBEA and LCM-MBC on 
two sets of biological graphs and three sets of random 
graphs. 

Comparison of MBEA and iMBEA 

In Figure 7 we compare the runtimes of MBEA and 
iMBEA on the twenty cerebellum graphs. The curves 
cross at a p-vdlue threshold of about 0.07. iMBEA is 
roughly three times as fast as MBEA at around threshold 
0.20. These results confirm our expectations that the rel- 
ative simplicity of MBEA wins on sparse graphs produced 
at lower thresholds, while the improvement overhead of 
iMBEA more than pays for itself once higher thresholds 
generate graphs that are sufficiently dense. 

We also compared MBEA and iMBEA on random bipar- 
tite graphs. As shown in Figure 8, while reasonably close, 
iMBEA consistently outperforms MBEA. The sorted can- 
didate vertex selection and enhanced pruning of iMBEA 



appear still to produce performance gains. These gains 
are not as significant, however, as they were for biologi- 
cal graphs. This may be due at least in part to the rather 
smoothed overall topology of random graphs, as opposed 
to the uneven density and highly irregular features typ- 
ically seen in graphs like those in GeneWeaver. To look 
closer into this behavior, we varied the CV with which 
random graphs were built. We found, as illustrated in 
Figure 9, that iMBEA outperforms MBEA on random 
bipartite graphs over the entire CV range tested. The 
performance gap is smaller when the CV is low, proba- 
bly due to MBEAs relative simplicity and reduced over- 
head. As the CV increases, however, the performance gap 
between MBEA and iMBEA widens. These results help 
explain iMBEAs superior performance on biologically- 
derived graphs, which very often exhibit high variation 
in vertex degree. When comparing our algorithms to 
other methods, we employ only iMBEA for simplicity. It 
is possible that on some inputs MBEA would do slightly 
better. 

Comparison of iMBEA and LCM-MBC 

Figure 10 shows the average runtimes of iMBEA and 
LCM-MBC on the biological graphs tested. Part (a) is 
the pheno-gene graphs, and parts (b) and (c) are two 
ranges of p- values for the cerebellum graphs. The perfor- 
mance disparity is most notable when the graphs grow 
denser. On both the cerebellum and pheno-gene graphs, 
the maximal bicliques in the densest graph exceed the 2 
GB disk storage limit of the LCM-MBC implementation, 
causing the program to halt prematurely, reporting only 
a portion of the maximal bicliques. The runtime of these 
two graphs would certainly be much higher if the limit 
were removed. The results of iMBEA and LCM-MBC on 
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Figure 7 Performance of MBEA versus IMBEA on biological graphs. Performance comparison of MBEA and iMBEA on 20 cerebellum graphs 
from GeneWeaver. As the size and density of the graphs increases, the small overhead incurred by iMBEA's pruning checks is quickly rewarded with 
performance gains from the additional pruning. 
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Figure 8 Performance of MBEA versus IMBEA on random graphs. Although runtimes are close, IMBEA consistently outperforms MBEA on 
random graphs. 



random bipartite graphs are shown in Figure 11. Both 
methods scale to graphs with thousands of vertices in each 
partition. The iMBEA algorithm, however, consistently 
and convincingly outperforms LCM-MBC. 

These figures highlight iMBEAs advantages in scal- 
ability. Methods tend not to look very different when 
graphs are sparse. As data quality improves, however, 
GeneWeaver and analysis tools of its ilk tend to employ 
denser graphs in order to capture deeper latent structure. 
This is where the design enhancements of iMBEA really 
start to become conspicuous and unmistakable. 

Utility in GeneWeaver 

GeneWeaver (http://geneweaver.org), formerly the Onto- 
logical Discovery Environment [11], seeks to identify 



unique and shared relationships between genes and their 
roles in biological processes. Aggregated genomic data 
is integrated, and relevant associations are represented, 
with discrete bipartite graphs. These allow relation- 
ships from diverse experimental sources to be combined. 
GeneWeaver employs MBEA/iMBEA on these graphs 
to discover the ontology or structured inheritance of 
biological processes through the genesets that support 
them. This is accomplished through an enumeration of 
maximal bicliques, which are organized as a directed 
acyclic graph (DAG) to form an empirically derived 
interpretation of relationships between biological pro- 
cesses. An implementation of this systematic approach, 
including MBEA/iMBEA, is embedded in the web-based 
GeneWeaver software platform. Data availability has 
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Figure 9 Effect of graph degree structure on MBEA and IMBEA. The average delay time of MBEA and iMBEA on random graphs with the same 
size and density, but varying degree distribution. On graphs with low coefficient of variation, the performance gap between MBEA and iMBEA is 
narrower than on graphs with high coefficient of variation. This confirms our expectation that the pruning enhancements of iMBEA have a larger 
effect on graphs with diverse degree structure. 
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Figure 10 Performance of iMBEA and LCM-MBC on GeneWeaver graphis.The GeneWeaver graphs were constructed from two different 
phenotype-gene similarity matrices. Eacli edge is eitlier present or absent based on wlietlier it is at or above (or at or below, when p-values are 
used) a given threshold. The graphs in (a) were created from a correlation matrix of 33 phenotypes and 1 7539 genes. Graphs in (b) and (c) were 
created from a matrix of correlation p-values for gene expression to phenotypes in a single mouse population, using 782 phenotypes and 45137 
genes. As the threshold moves to the right along the x-axis, the graphs generally grow larger and denser. The pheno-gene graphs range from 6.6% 
to 34.7% density, while the cerebellum graphs range from about 0.2% to about 2.5% density. 
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Figure 1 1 Performance of iMBEA and LCM-MBC on random bipartite graphs. The Erdos-Renyi random bipartite graplis in (a) liave tlie number 
of vertices in eacli partition fixed at 300, but tine density is varied from 0.1 to 0.28, sliowing liow density affects runtime. Similar results on graphs 
with each partition fixed at 400 and 500 vertices are omitted for space considerations. The graphs in (b) and (c) were generated using the random 
graph generator described in the text, with CV of 1 .0 and density of 0.03. In (b), the size of the larger partition is fixed at 10,000 while the size of the 
smaller partition is varied. In (c) the converse occurs; the size of the smaller partition is fixed at 500 while the size of the larger partition is varied. In all 
three cases, the performance disparity between iMBEA and LCM-MBC is apparent. 
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driven this application to emphasize genes as the pri- 
mary biological entity through which relationships are 
inferred. Nevertheless, the model is general enough to 
map easily onto other biological entities or attributes. 
Thus, GeneWeaver provides a computationally scalable 
approach to subset-subset matching in the quest to 
increase our understanding of molecular networks that 
support biological function. 

Motivation 

A major challenge in bioinformatics is to identify relation- 
ships among poorly characterized genes and their varied 
roles in biological processes, and to group these pro- 
cesses along functionally meaningful lines. For example, 
one may be interested in whether (and which of) the bio- 
logical bases of psychiatric disorders such as anxiety or 
depression are also involved in alcoholism. Each disorder 
may be attributable to multiple genes, and each gene may 
be involved in multiple disorders (pleiotropy). Biological 
processes are typically categorized by ontologists based on 
their external manifestations. Unfortunately, phenomena 
such as convergent evolution (when two similar structures 
or functions are obtained through distinct evolutionary 
processes) and other factors that result in functional sim- 
ilarity can lead to poor classification schemes that do 
not map onto the supporting biology. Thus, for basic 
research into discovery of the biological underpinnings of 
diverse processes, a classification of biological functions 
can instead be based on sets of underlying genes. Finite 
simple graphs are a natural way to represent relationships 
between such sets. Graph algorithms are a useful tool in 
their analysis and interpretation. The need to study whole 
genome versus biological functional data makes bipar- 
tite graphs an appropriate model for finding associations 
between pairs of disparate data types. Enormous corre- 
lational structures can arise in data of this size, however, 
potentially making the task of biclique enumeration a lim- 
iting computational bottleneck. This is because classifica- 
tion and assessment of the phenome space is theoretically 
unbounded, especially in the case of genome-scale onto- 
logical discovery. The MBEA and iMBEA methods were 
therefore developed to harness fast algorithm design tech- 
niques and to exploit bipartite graph structure in order 
to satisfy the staggering computational demands that 
may be incurred in the creation of emergent phenotypic 
ontologies. 

Data 

A biological pathway or process can be associated with 
a set of genes. Such a set typically comes from some 
biological source, for example, an experiment related to 
drug abuse. Gene sets can be generated with any method- 
ology dedicated to gene-network creation. Commonly 
used methods include differential gene expression, genetic 



correlation to gene expression, positional candidates from 
genetic mapping, associations obtained from text mining, 
and literature reviews and/or empirical studies in which 
researchers compile gene lists involved in various behav- 
ioral constructs such as pain, aggression, alcoholism and 
drug abuse. 

The GeneWeaver database currently contains over 
75,000 gene sets covering nine species: Caenorhabditis 
elegans (roundworm), Danio rerio (zebrafish), Drosophila 
melanogaster (fruit fly), G alius gallus domesticus 
(chicken). Homo sapiens (human), Macaca mulatta 
(monkey), Mus musculus (mouse), Rattus norvegicus (rat) 
and Saccharomyces cerevisiae (yeast). When sets from 
different species are combined, gene homology is used 
to match genes onto a set of reference gene ID clusters. 
Although the gene-set space is unlimited, the genome 
space is constrained by the finiteness of the genome itself. 
(The human genome, for example, is currently estimated 
to contain roughly 20,000-25,000 genes.) It should be 
noted that the method described here is extensible to 
include any biomolecule associated with a function or 
process, including miRNA, transcript forms, gene prod- 
ucts and their various states and many additional entities 
involved in biological processes. Likewise, it is often desir- 
able to use abundance or co-occurrence statistics to relate 
one class of biomolecules to another, including transcripts 
and miRNA, or transcripts and proteins. Thus, the size of 
the biomolecular vertex class is also potentially without 
bound. 

Model 

A biclique-based model was developed to extract func- 
tions along with functionally similar genes from gene 
sets derived from various sources, and then to orga- 
nize them as a DAG to represent an entire ontology 
of biological functions. This model consists of three 
major components: a combine module to compute gene- 
set association matrices to construct bipartite graphs 
via thresholding and graph mapping, a biclique module 
using MBEA/iMBEA to enumerate maximal bicliques 
from gene-set bipartite association graphs, and a phenome 
graph module to organize gene sets by integrating max- 
imal bicliques into a DAG to represent an ontology of 
functions. 

Graphs 

The combine module melds gene sets from various 
sources, computes a real-valued scoring matrix to asso- 
ciate genes with functions, converts the matrix to binary 
by applying a suitable threshold, and transforms the 
matrix into a bipartite association graph. Homology may 
be employed when more than one species is involved. 
Scoring can be based on a variety of statistical metrics, 
including correlation coefficients, p oi q values, literature 
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Figure 12 The hierarchical similarity graph. Creation of the hierarchical similarity graph of gene sets: (a) the source gene sets, (b) the gene-set 
bipartite graph, (c) the gene-set bipartite graph with two of its maximal bicliques highlighted, ({D/?D2}, {GSl 256, GSl 1 32}) and ({NPY, PDYN, 
CYP2D6}, {GSl 132, GSl 229}), and (d) integration of all maximal bicliques to a DAG of function relationships. In this graph, the two highlighted 
maximal bicliques from (b) are roots. The maximal biclique ({25 genes}, {GSl 1 32}) is a child of both, since {GSl 1 32} is a subset of the gene sets in 
both roots. Geneset {GSl 1 32} is associated with the genes DRD2 and NPY, PDYN, CYP2D6 as its parents, but is also connected to 21 genes not shown. 



associations and other categorical analyses. Threshold- 
ing may be soft or hard, and is generally performed with 
the aid of low and high pass filters. Keywords such as 
"drug" "alcohol" and "cerebellum" are used to select gene 
sets, based on search term occurrence in metadata. These 
sets may be fused to form larger collections of putative 
biological functions. 

Biclique enumeration 

The biclique module uses MBEA/iMBEA to enumerate 
all maximal bicliques in the bipartite gene-set association 
graph. Here a biclique represents the relationship between 
a set of biological functions and the genes with which they 
are commonly associated. Maximality ensures that this 



relationship is not properly contained within another. A 
maximal biclique thus denotes a unique set of function- 
ally similar biological processes along with the genes they 
share in common. 

Ontologlcal Integration 

The phenome graph module constructs an ontology of 
functions. Maximal bicliques are connected based on 
their relationships. The resultant hierarchical similarity 
graph represents sets of genes associated with common 
functions. Note that DAGs are similar to hierarchies 
(forests), except that a child node in a DAG may have 
more than one parent node. The formulation of the 
hierarchical similarity graph is based on the following 
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Table 2 Gene sets extracted from GeneWeaver 

GeneWeaver IDs Descriptions # Genes 

G51 1 32 Addiction candidate genes derived from literature review [33] 25 

GS1 229 Differential gene expression among heroine and cocaine abusers [34] 693 

GS1 256 Gene expression in hippocampus from human cocaine abusers [35] 38 



observation, which helps us define an inherent biclique 
ordering. 

Observation 6. Let P(b) denote the set ofphenotypes in 
a biclique, b, and let Gib) denote its set of genes. Given two 
maximal bicliques bi and b2, P{bi) C ^(^2) if and only if 
G(bi) D G{b2\ andPibi) D P(Z?2) if and only if G{bi) C 
G{b2\ 

We can now define a hierarchical similarity graph using 
maximal bicliques for nodes and a partial ordering of the 
bicliques for arcs (directed edges). Node bi will be an 
ancestor of node ^2 iff ^(^1) ^ ^(^2)- In this case we say 
that Z?2 is a descendant of A node with no ancestors is 
said to be a root. One with no descendants is said to be 
a leaf. Node bi will be a parent of ^2 iff it is an ancestor 
and there is no other node Z73 for which P{b\) D Pib^) 
and Pibs) D ^(^2)- In this case we say that ^2 is a child of 
bi. Once these relationships have been formed, an arc is 
placed from a parent to each of its children. 

Figure 12 illustrates this construction. A sample hierar- 
chical similarity graph is built from three human gene sets 
taken from GeneWeaver using the drug-related gene sets 
listed in Table 2. These sets contain many genes, but we 
are chiefly interested in the ten genes that are each shared 
by at least two of the sets. These genes and sets are used 
to build a gene-set association graph, from which a total 
of six maximal bicliques are extracted. 

Despite GeneWeaver 's size and scope, MBEA/iMBEA 
currently requires at most a few minutes to enumerate 
maximal bicUques on legitimate queries. A more subtle 
but equally important task that it must perform is the 
computation of significance levels for DAG scoring (based 
on factors such as height and width) among graphs with 
the same number of genes, gene sets and gene-set asso- 
ciations. Here a re-sampling procedure can be applied 
to simulate variations in gene-set intersection topology. 
Such a procedure can easily require tens of thousands of 
re-sampling operations, however, each needing its own 
list of maximal bicliques. MBEA/iMBEA can accomplish 
this task in less than an hour using current technologies, 
while previous methods were untenable, often consuming 
several days even on just a few hundred gene sets. 

Conclusions 

We introduced a novel algorithm, MBEA, to enumerate 
maximal bicliques in a bipartite graph. The technique 



we described employs efficient branching and pruning 
strategies to eliminate paths that cannot lead to maxi- 
mal bicliques. We also presented an improved version 
of this algorithm, iMBEA, that selects candidate vertices 
in non-decreasing order of common neighborhood size. 
Extensive empirical evaluation revealed that iMBEA out- 
performs MBEA on both biological and random graphs. 
Furthermore, we tested iMBEA against MICA, a fast 
consensus algorithm, and against LCM-MBC, a fre- 
quent closed itemset data mining method. We observed 
that both iMBEA and LCM-MBC are orders of mag- 
nitude faster than MICA, which we thus eliminated 
from further consideration. We also found that iMBEA 
is significantly faster than LCM-MBC, on both ran- 
dom graphs and biologically-based graphs derived from 
GeneWeaver (http://geneweaver.org), an online system 
for the integration of functional genomics experiments. 
Armed with iMBEA, GeneWeaver provides users with the 
computational capacity to perform genome-scale analyses 
of complex relationships derived from diverse biological 
experiments, with the goal to discover the ontology or 
structured inheritance of biological processes. MBEA and 
iMBEA are apt to be well suited to any application in 
which bipartite graphs can be used to model relationships 
between two sets of diverse items. 
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